Detailed Summary: Asynchronous Methods for Deep Reinforcement Learning (A3C)

Citation: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, Koray Kavukcuoglu. Google DeepMind / Montreal Institute for Learning Algorithms (MILA). ICML 2016. arXiv:1602.01783v2.

Abstract (Paraphrased)

This paper presents a conceptually simple and lightweight framework for deep RL using asynchronous gradient descent for optimization of deep neural network controllers. The key idea is running multiple actor-learners in parallel on independent environment instances, using their diverse experience to decorrelate training data without experience replay. Four asynchronous algorithms are proposed: one-step Q-learning, one-step Sarsa, n-step Q-learning, and Asynchronous Advantage Actor-Critic (A3C). A3C surpasses state-of-the-art on Atari using half the training time of prior work, running on a single multi-core CPU rather than a GPU.

Motivation

The Non-Stationarity Problem in Online Deep RL

When a single agent interacts sequentially with an environment, consecutive observations (s_t, a_t, r_t, s_{t+1}) are highly correlated. Training a deep neural network directly on such correlated sequences produces unstable, divergent learning because the data distribution shifts as the policy changes.

The dominant solution before this paper was experience replay (used in DQN): store transitions in a large replay memory, then sample random mini-batches to break temporal correlations. This works but has three drawbacks:

Requires storing millions of transitions (high memory).
Requires more computation per real environment step.
Can only be applied to off-policy algorithms (Q-learning, DQN) because sampled transitions may come from a different policy than the current one.

This restriction prevented using on-policy algorithms (Sarsa, actor-critic, REINFORCE) with deep networks.

The Insight: Parallelism as a Decorrelator

If multiple agents independently interact with separate environment instances simultaneously, at any given moment the agents are experiencing different states. The resulting training data is naturally diverse. This eliminates the need for a replay buffer entirely and allows both on-policy and off-policy deep RL to be trained stably.

An additional benefit: wall-clock training time scales (near-linearly for some methods) with the number of parallel workers, since more environment interactions occur per unit time.

RL Background

Standard Formulation

At each timestep t, agent observes state s_t, selects action a_t ~ pi(a_t|s_t), receives reward r_t, transitions to s_{t+1}.
Return from time t: R_t = sum_{k=0}^{inf} gamma^k * r_{t+k}, discount factor gamma in (0,1].
Goal: maximize E[R_t] from each state s_t.

Value Functions

Action-value: Q^pi(s,a) = E[R_t | s_t=s, a_t=a]
Optimal: Q*(s,a) = max_pi Q^pi(s,a)
State-value: V^pi(s) = E[R_t | s_t=s]

One-Step Q-Learning Loss

L_i(theta_i) = E[(r + gamma * max_{a'} Q(s',a';theta_{i-1}) - Q(s,a;theta_i))^2]

Policy Gradient (REINFORCE)

Policy gradient: nabla_theta E[R_t] = nabla_theta log pi(a_t|s_t;theta) * R_t

Variance-reduced version using baseline b_t(s_t) approx V^pi(s_t):

gradient = nabla_theta log pi(a_t|s_t;theta) * (R_t - b_t(s_t))

The quantity (R_t - b_t(s_t)) = A(a_t, s_t) is the advantage of action a_t in state s_t — how much better it is than average.

This leads to the actor-critic architecture: actor = policy pi, critic = value function V^pi (the baseline).

Asynchronous RL Framework (Section 4)

Two Core Design Choices

Choice 1: Multi-thread on a single machine. Unlike Gorila (which uses 100 separate actor-learner processes and 30 parameter servers across many machines), A3C runs all workers as threads on a single multi-core CPU. This eliminates inter-machine communication cost and allows lock-free Hogwild!-style parameter sharing via shared memory.

Choice 2: Diverse exploration policies per thread. Each thread samples its exploration parameter epsilon from a distribution, creating naturally diverse data across workers. This further reduces temporal correlation.

Four Algorithms

Asynchronous One-Step Q-Learning (Algorithm 1)

Each thread:

Takes action with epsilon-greedy policy based on local Q(s,a;theta').
Computes gradient: d(y - Q(s,a;theta))^2 / d theta, where y = r (terminal) or r + gamma * max_{a'} Q(s',a';theta^-) (non-terminal).
Accumulates gradient into d_theta.
Asynchronously updates global theta using d_theta every I_AsyncUpdate steps.
Periodically syncs target network theta^- <- theta every I_target steps.

Asynchronous One-Step Sarsa

Same as above but uses on-policy target: y = r + gamma * Q(s',a';theta^-) where a' is the next action taken by the policy (not the max).

Asynchronous n-Step Q-Learning (Algorithm S2)

Operates in the forward view: accumulate up to t_max steps of experience, then compute multi-step return:

R = r_t + gamma*r_{t+1} + ... + gamma^{n-1}*r_{t+n-1} + gamma^n * max_a Q(s_{t+n},a;theta^-)

Compute gradient for each state-action pair in the window, accumulate, then apply in one async update.

Asynchronous Advantage Actor-Critic — A3C (Algorithm S3)

The headline algorithm. Maintains two networks sharing most parameters:

Actor: policy pi(a_t|s_t; theta)
Critic: value function V(s_t; theta_v)

Per-thread pseudocode:

repeat:
  Reset gradients d_theta = 0, d_theta_v = 0
  Sync local params: theta' = theta, theta_v' = theta_v
  t_start = t
  Get state s_t

  repeat:
    Select a_t ~ pi(a_t|s_t; theta')
    Receive r_t, s_{t+1}
    t++, T++
  until terminal s_t or t - t_start == t_max

  if terminal: R = 0
  else: R = V(s_t; theta_v')   // bootstrap from last state

  for i = t-1, ..., t_start:
    R = r_i + gamma * R
    d_theta += nabla_{theta'} log pi(a_i|s_i;theta') * (R - V(s_i;theta_v'))
    d_theta_v += d(R - V(s_i;theta_v'))^2 / d theta_v'

  Async update theta using d_theta
  Async update theta_v using d_theta_v
until T > T_max

Entropy regularization: The full policy gradient objective includes an entropy bonus:

nabla_{theta'} log pi(a_t|s_t;theta') * (R_t - V(s_t;theta_v')) + beta * nabla_{theta'} H(pi(s_t;theta'))

where H is the entropy of the policy distribution and beta=0.01. This discourages premature convergence to deterministic policies.

System Architecture

Shared Memory (Global Parameters)
+-------------------------------------+
|  theta (policy network weights)     |
|  theta_v (value network weights)    |
|  g (shared RMSProp moving average)  |
|  T (global step counter)            |
+-------------------------------------+
        ^  ^  ^  ^  ^  ^  ^  ^
        |  |  |  |  |  |  |  |
  Thread-1  Thread-2  ...  Thread-N
  +------+  +------+       +------+
  |Env   |  |Env   |       |Env   |
  |copy  |  |copy  |       |copy  |
  |      |  |      |       |      |
  |theta'|  |theta'|       |theta'|
  |theta_|  |theta_|       |theta_|
  |v'    |  |v'    |       |v'    |
  +------+  +------+       +------+
    Collects    Collects       Collects
    t_max steps t_max steps    t_max steps
    Computes    Computes       Computes
    gradients   gradients      gradients
    Async       Async          Async
    update      update         update
    global      global         global
    theta       theta          theta

Each thread maintains a thread-local copy of parameters (theta', theta_v'). Parameters are synchronized from the global at the start of each t_max-step rollout. Gradients from the rollout are applied asynchronously to the global — no locking, no coordination.

Neural Network Architecture

For Atari experiments:

Conv layer: 16 filters, 8x8, stride 4, ReLU
Conv layer: 32 filters, 4x4, stride 2, ReLU
Fully connected: 256 hidden units, ReLU
Actor head: softmax over actions (one output per action)
Critic head: single linear output (V(s))
All non-output layers shared between actor and critic
LSTM variant: adds 256 LSTM cells after the last fully connected layer

For MuJoCo continuous control:

Encoder: two convolutional layers -> 128 LSTM cells (for pixel input); or single FC layer with 200 ReLU units (for physical state input)
Actor head: mean vector mu (linear) + variance sigma^2 (SoftPlus)
Policy samples from N(mu, sigma^2 * I)

Optimizer: Shared RMSProp

Standard non-centered RMSProp update:

g = alpha*g + (1-alpha)*delta_theta^2
theta = theta - eta * delta_theta / sqrt(g + epsilon)

Three variants compared:

Momentum SGD: per-thread momentum vector.
RMSProp: per-thread moving average g.
Shared RMSProp: g shared across all threads, updated asynchronously without locking.

Shared RMSProp is consistently more robust to learning rate and initialization variance than the other two. All experiments use Shared RMSProp with alpha=0.99.

Evaluation

Setup

Hardware: Single machine with 16 CPU cores; no GPU.
Comparison baseline: DQN trained on a single Nvidia K40 GPU (8 days).
Atari: 57 games in the Arcade Learning Environment (ALE). Human starts evaluation metric.
TORCS: 3D racing simulator, 4 speed/bot configurations.
MuJoCo: Rigid body physics, multiple locomotion/manipulation tasks.
Labyrinth: Randomly generated 3D mazes, visual input only.
Hyperparameter search: 50 random seeds per game; learning rate sampled from LogUniform(1e-4, 1e-2); all other hyperparameters fixed.

Key Results

Atari 57 Games (Table 1, Human-Start Metric)

Method	Training	Mean (%)	Median (%)
DQN	8 days GPU	121.9	47.5
Gorila	4 days, 100 machines	215.2	71.3
Double DQN	8 days GPU	332.9	110.9
Dueling Double DQN	8 days GPU	343.8	117.1
Prioritized DQN	8 days GPU	463.6	127.6
A3C FF (1 day)	1 day CPU	344.1	68.2
A3C FF	4 days CPU	496.8	116.6
A3C LSTM	4 days CPU	623.0	112.6

A3C LSTM achieves state-of-the-art mean score with no GPU, matching Double DQN's median in half the time.

Scalability (Table 2): Training Speedup vs. Number of Threads

Method	1 thread	2	4	8	16
1-step Q	1.0	3.0	6.3	13.3	24.1
1-step Sarsa	1.0	2.8	5.9	13.1	22.1
n-step Q	1.0	2.7	5.9	10.7	17.2
A3C	1.0	2.1	3.7	6.9	12.5

N-step methods scale slightly sublinearly (due to higher gradient variance from multi-step returns). A3C scales worse than value-based methods but provides the best final performance.

Robustness (Fig. 2)

Scatter plots of final scores over 50 learning rates and random initializations show large "good performance" regions with no zero-score collapses — the algorithm does not diverge.

TORCS

A3C reaches 75–90% of human tester score across four configurations (slow/fast car, with/without bots) in ~12 hours of training.

MuJoCo

A3C solves all tested tasks within 24 hours using CPU; typical solution time is a few hours.

Labyrinth (3D Maze)

A3C LSTM reaches average score ~50 on randomly generated mazes using only 84x84 pixel visual input, demonstrating general exploration strategy learning.

RL Formulation Table

Component	A3C Specification
Agent	Actor-critic neural network (shared CNN/LSTM backbone)
Environment	Atari ALE / TORCS / MuJoCo / Labyrinth; one copy per thread
State (s_t)	Last 4 game frames (84x84 grayscale) for Atari; RGB frame / physical state for others
Action (a_t)	Discrete: argmax of softmax policy output; Continuous: sample from N(mu, sigma^2)
Reward (r_t)	Game score delta (clipped to [-1,+1] for Atari); velocity reward for TORCS
Return (R)	n-step discounted return R = r_t + gammar_{t+1} + ... + gamma^{n-1}r_{t+n-1} + gamma^n * V(s_{t+n}; theta_v')
Advantage	A(a_t,s_t) = R - V(s_t; theta_v')
Policy update	nabla_{theta'} log pi(a_t
Value update	Minimize (R - V(s_t;theta_v'))^2
Algorithm	On-policy policy gradient with advantage baseline (actor-critic)
Optimizer	Shared RMSProp, alpha=0.99, initial lr ~ LogUniform(1e-4, 1e-2)
Discount gamma	0.99
t_max	5 (for Atari); episode length for MuJoCo
Entropy beta	0.01
Workers	16 CPU threads

Limitations

Stale gradients: Threads apply gradients computed from parameters that may be several updates old. Empirically stable, but no convergence guarantees with non-linear function approximation.
Data efficiency: Without replay, each transition is used exactly once. Off-policy methods with replay can reuse transitions many times, potentially learning faster per environment interaction.
Thread scaling for A3C: Sublinear speedup (12.5x at 16 threads vs. 24x for 1-step Q) due to higher-variance multi-step advantage estimates.
Hyperparameter sensitivity (relative): Although more robust than DQN, A3C still requires tuning learning rate, t_max, entropy coefficient beta, and epsilon schedule.
No prioritization: Unlike Prioritized DQN, A3C does not focus on high-surprise transitions; in sparse reward environments this may slow learning.

DQN (Mnih et al., 2015): Experience replay + target network; foundational GPU-based deep RL.
Gorila (Nair et al., 2015): Distributed DQN on 100 machines + 30 parameter servers; high infrastructure cost.
Double DQN (Van Hasselt et al., 2015): Reduces Q-value overestimation; orthogonal to A3C.
Prioritized Experience Replay (Schaul et al., 2015): Can be incorporated into A3C via replay buffer.
TRPO (Schulman et al., 2015a): Trust region policy optimization; stronger convergence guarantees but more complex.
Hogwild! (Recht et al., 2011): Lock-free asynchronous SGD; theoretical basis for A3C's update mechanism.
Tsitsiklis (1994): Convergence of asynchronous Q-learning; theoretical underpinning.

Relevance to DynamICCL

High direct relevance. A3C is a foundational algorithm that directly informs DynamICCL's Config Agent design:

Direct algorithm alternative to DQN: DynamICCL's Agent-2 currently uses DQN. A3C (particularly the LSTM variant) is a strong alternative for the following reasons:
- DynamICCL's NCCL configuration environment is non-stationary (congestion fluctuates); A3C's on-policy, replay-free design adapts more quickly to shifting distributions.
- A3C with LSTM naturally captures temporal patterns in congestion signals, complementing Agent-1's CUSUM detection.
- A3C's actor-critic structure explicitly separates "how good is this state" (V(s)) from "how good is this action" (A(s,a)), which is valuable for DynamICCL because collective completion time is a noisy, state-dependent signal.
Multi-worker parallelism for NCCL config exploration: DynamICCL operates in a cluster environment where multiple nodes run collective operations simultaneously. Each node could host a separate A3C worker exploring NCCL configurations, sharing a global policy — this is precisely the A3C framework applied to the DynamICCL training problem.
Entropy regularization for exploration: NCCL's configuration space has 7 algorithms * 3 protocols * 8 channel counts = 168+ discrete combinations. A3C's entropy bonus encourages exploration of this discrete space, preventing premature convergence to a locally-optimal but globally suboptimal configuration.
Advantage function for variance reduction: Collective completion times have high variance due to network jitter and background traffic. The advantage function A(s,a) = R - V(s) subtracts a state-dependent baseline, reducing gradient variance — critical for DynamICCL's noisy reward signal.
CPU-only training on Chameleon Cloud: DynamICCL trains on bare-metal HPC nodes. A3C's single-machine multi-thread CPU-based training eliminates the need for dedicated GPU training infrastructure during the RL training phase, leaving GPU resources free for the actual collective workloads being optimized.
LSTM for temporal state: Agent-1 (Trigger Agent) uses LSTM+CUSUM. An A3C-LSTM Config Agent could jointly maintain temporal state about recent collective performance, correlating past congestion signals with configuration outcomes — a more unified architecture than two separate agents.